Extending Huffman Coding for Multilingual Text Compression
نویسندگان
چکیده
Traditional text compression algorithms such as Huffman and LZ variants are usually based on 8-bit characters sampling. However, under the unicode representation for multilingual information, the character set of each language such as Chinese and Japanese is consisted of a very number of distinct characters and thus 16-bit or 32-bit character sampling is needed. Consequently, when text compression algorithms based on 8-bit character sampling’is applied to documents using 16-bit or 32 bit character sampling, very poor data compression ratio (average about 1.5) is obtained. In this paper, we propose two new algorithms that are based on the 16-bit or 32-bit sampling character set and on the unique features of the languages with large number of distinct characters to improve data compression ratios for multilingual text documents significantly. We choose Chinese language using 16 bit character sampling (such as Big-5 or GB code) as the representative language in our study. The first approach, called the Static Chinese Huffman Coding (Huffs &, is to introduce the concept of a single Chinese character in the Huffman tree. Experimental results on our PH corpus showed that the improvement in compression ratio obtained by ff@-s chi ranges from 20% to 29%. The second approach, called the Dictionary-Based Chinese Huffman Coding
منابع مشابه
Performance Improvement Of Bengali Text Compression Using Transliteration And Huffman Principle
In this paper, we propose a new compression technique based on transliteration of Bengali text to English. Compared to Bengali, English is a less symbolic language. Thus transliteration of Bengali text to English reduces the number of characters to be coded. Huffman coding is well known for producing optimal compression. When Huffman principal is applied on transliterated text significant perfo...
متن کاملData Compression Considering Text Files
Lossless text data compression is an important field as it significantly reduces storage requirement and communication cost. In this work, the focus is directed mainly to different file compression coding techniques and comparisons between them. Some memory efficient encoding schemes are analyzed and implemented in this work. They are: Shannon Fano Coding, Huffman Coding, Repeated Huffman Codin...
متن کاملText Compression Algorithms - a Comparative Study
Data Compression may be defined as the science and art of the representation of information in a crisply condensed form. For decades, Data compression has been one of the critical enabling technologies for the ongoing digital multimedia revolution. There are a lot of data compression algorithms which are available to compress files of different formats. This paper provides a survey of different...
متن کاملEfficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language
The development of an efficient compression scheme to process the Arabic language represents a difficult task. This paper employs the dynamic Huffman coding on data compression with variable length bit coding, on the Arabic language. Experimental tests have been performed on both Arabic and English text. A comparison is made to measure the efficiency of compressing data results on both Arabic a...
متن کاملComparative study of Arithmetic and Huffman Compression Techniques for Enhancing Security and Effective Bandwidth Utilization in the Context of ECC for Text
In this paper, we proposed a model for text encryption using elliptic curve cryptography (ECC) for secure transmission of text and by incorporating the Arithmetic/Huffman data compression technique for effective utilization of channel bandwidth and enhancing the security. In this model, every character of text message is transformed into the elliptic curve points 1 / 4
متن کامل